Intro

NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.

Data

Goal is to use this data to categorize risk of death by day of the week on a county level.

We will start with monthly data and attempt to work into the day of the week. Due to anonymization of data, days of the week are specified but not the actual dates. I.e, All Fridays of the month are bundled into a single Friday variable.

Underlying cause of death 2018-21

  • Data was pulled from the CDC website by grouping by County, Month, and Weekday.
  • Suppressed death counts were added via the CDC to protect privacy, and mask any death values number 1-9.
    • Suppressed values replaced with 5.

Glimpse

Initial imports were performed on Texas and Oklahoma ranging from 2018-21 as these two locations were specifically mentioned by NB as some of their primary areas of focus.

adj_death values are calculated by substracting COVID deaths from in-hospital deaths. crude_rate_10k values are calculated by dividing adj_death values by county_population values and multiplying by 10,000. This gives a mortality rate per 10,000 people per county.

df_wkday %>% glimpse()
## Rows: 127,104
## Columns: 9
## $ year                 <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
## $ month                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ wkday                <fct> Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ state                <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, O…
## $ county               <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair, …
## $ county_code          <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001, …
## $ county_population    <dbl> 22082, 22082, 22082, 22082, 22082, 22082, 22082, …
## $ wkday_adj_deaths     <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ wkday_crude_rate_10k <dbl> 0.4528575, 0.4528575, 0.0000000, 0.4528575, 0.000…
df_monthly %>% glimpse()
## Rows: 15,888
## Columns: 8
## $ year                   <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ month                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,…
## $ date                   <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01…
## $ state                  <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK,…
## $ county                 <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair…
## $ county_code            <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001…
## $ monthly_adj_deaths     <dbl> 16, 20, 15, 14, 1, 11, 10, 17, 1, 11, 1, 17, 1,…
## $ monthly_crude_rate_10k <dbl> 7.2457205, 9.0571506, 6.7928630, 6.3400054, 0.4…
df_annual %>% glimpse()
## Rows: 1,324
## Columns: 7
## $ year                  <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ state                 <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, …
## $ county                <fct> Adair, Alfalfa, Atoka, Beaver, Beckham, Blaine, …
## $ county_code           <dbl> 40001, 40003, 40005, 40007, 40009, 40011, 40013,…
## $ county_population     <dbl> 22082, 5754, 13838, 5319, 21709, 9485, 47192, 28…
## $ annual_adj_deaths     <dbl> 156, 39, 88, 38, 145, 79, 322, 258, 707, 349, 31…
## $ annual_crude_rate_10k <dbl> 70.64577, 67.77894, 63.59300, 71.44200, 66.79257…

Suppressed

Early on in EDA it became evident that some replaced Suppressed values were skewing data in counties with low populations. For example, Loving, TX had just a population of 57, but with 5 deaths the crude rate is at a dramatic 877 deaths per 10k people.

df_annual_5 %>%
  select(state, county, county_population, annual_adj_deaths, annual_crude_rate_10k) %>%
  arrange(-annual_crude_rate_10k) %>% 
  slice(1:100)

Histogram

When Suppressed values are replaced with 5, we also see it’s possible to have negative death rates. For example, if there are 5 months in a year with Suppressed COVID hospital deaths, they are all converted to the value 5. The summarised value would then be 25 COVID hospital deaths. The non-suppressed data shows, however, that the total in-hospital deaths over that same year is 10. Since our adjusted death rate subtracts in hospital deaths from COVID deaths we end up with 10 - 25 = -15 in-hospital, non-COVID deaths. Logically, this doesn’t make sense and we could assume the true suppressed values are between 1 and 2 for each fo those months.

This histogram below illustrates the issue with using a value of 5 for Suppressed values, resulting in negative value as well as dramatically high rates for low population counties.

ggplotly(
  df_annual_5 %>% 
    ggplot(aes(x=annual_crude_rate_10k))+
    geom_histogram(bins=50, fill=NA, color="black")+
    geom_vline(xintercept = 0, color="red")+
    labs(
      title="Distribution of annual crude death rates"
    )
  )

Replace with 1

If we replace values with 1 instead of 5 we see the distribution follows a more normal curve as would be expected. However, we do see that some rates appear to still fall into the negative ranges. These specific examples are again artifacts of our imperfect replacement with 1.

ggplotly(
  df_annual_neg %>%
    ggplot(aes(x=annual_crude_rate_10k))+
    geom_histogram(bins=50, fill=NA, color="black")+
    geom_vline(xintercept = 0, color="red")+
    labs(
      title="Better distribution of annual crude death rates"
    )
  )

Because we don’t want to remove any counties from representation we will simply add 3 to the annual_hospital_deaths values so that the rate remains positive. An ideal solution may have been to perform a more advanced imputation with a glm, but this gets us in the right direction.

The rows in question with negative values.

df_annual %>% 
  filter(annual_crude_rate_10k < 0)

Modifying said rows and displaying the final adjusted calculated variables:

ggplotly(
  df_annual %>% 
      ggplot(aes(x=annual_crude_rate_10k))+
      geom_histogram(bins=50, fill=NA, color="black")+
      # geom_vline(xintercept = 0, color="red")+
      labs(
        title="Final distribution of annual crude death rates"
      )
)

We can also see the distribution remains relatively similar over the years for each state.

ggplotly(
  df_annual %>% 
      ggplot(aes(x=annual_crude_rate_10k))+
      geom_histogram(bins=50, fill=NA, color="black")+
      # geom_vline(xintercept = 0, color="red")+
      labs(
        title="Final distribution of annual crude death rates"
      ) +
    facet_wrap(~state+year, nrow = 2)
)

Annually

Initial attempts at visualizing crude death rates over time

We can show that over time the death rates have migrated from one county to another, and that this rate is not correlated to county population.

Joining the map data with crude death rate per 10k people per county, and then binning those values into 3 equal groups, and plotting them on maps.

OK

Starting with Oklahoma

# Annual map data ----
## Oklahoma ----

# join crude rates with map data
df_ok_2021 <- ok_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2021' & state == 'OK') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_ok_2020 <- ok_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2020' & state == 'OK') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_ok_2019 <- ok_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2019' & state == 'OK') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_ok_2018 <- ok_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2018' & state == 'OK') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )

# divide into equal groups by rate
df_ok_2021$group <- as.factor(cut_number(df_ok_2021$rate, 3))
df_ok_2020$group <- as.factor(cut_number(df_ok_2020$rate, 3))
df_ok_2019$group <- as.factor(cut_number(df_ok_2019$rate, 3))
df_ok_2018$group <- as.factor(cut_number(df_ok_2018$rate, 3))

# make plots
gg_ok_2021 <- df_ok_2021 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2021')
gg_ok_2020 <- df_ok_2020 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2020')
gg_ok_2019 <- df_ok_2019 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2019')
gg_ok_2018 <- df_ok_2018 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2018')

# populations
df_ok_2021_pop <- ok_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2021' & state == 'OK') %>% 
      select(county, county_code, county_population) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "population"))
  )

df_ok_2021_pop$group <- as.factor(cut_number(df_ok_2021_pop$population/1000, 3))

gg_ok_2021_pop <- df_ok_2021_pop %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2021 Population')

Population

gg_ok_2021_pop

2021

gg_ok_2021

2020

gg_ok_2020

2019

gg_ok_2019

2018

gg_ok_2018

Table

Data in table format.

df_annual %>% filter(state == 'OK') %>% datatable()

TX

It’s actually quite clear that more rural areas in texas are suffering from a higher crude death rate.

# join crude rates with map data
df_tx_2021 <- tx_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2021' & state == 'TX') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_tx_2020 <- tx_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2020' & state == 'TX') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_tx_2019 <- tx_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2019' & state == 'TX') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )
df_tx_2018 <- tx_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2018' & state == 'TX') %>% 
      select(county, county_code, annual_crude_rate_10k) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "rate"))
  )

# divide into equal groups by rate
df_tx_2021$group <- as.factor(cut_number(df_tx_2021$rate, 3))
df_tx_2020$group <- as.factor(cut_number(df_tx_2020$rate, 3))
df_tx_2019$group <- as.factor(cut_number(df_tx_2019$rate, 3))
df_tx_2018$group <- as.factor(cut_number(df_tx_2018$rate, 3))

# make plots
gg_tx_2021 <- df_tx_2021 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2021')
gg_tx_2020 <- df_tx_2020 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2020')
gg_tx_2019 <- df_tx_2019 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2019')
gg_tx_2018 <- df_tx_2018 %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2018')

df_tx_2021_pop <- tx_counties %>% 
  left_join(
    df_annual %>% 
      filter(year == '2021' & state == 'TX') %>% 
      select(county, county_code, county_population) %>% 
      mutate(county_code = as.character(county_code),
             county = as.character(county)) %>% 
      set_colnames(c("NAME", "GEOID", "population"))
  )

df_tx_2021_pop$group <- cut_number(df_tx_2021_pop$population/1000, 3)

gg_tx_2021_pop <- df_tx_2021_pop %>% 
  ggplot(aes(fill=group))+
  geom_sf() + 
  theme_void() + 
  scale_fill_manual(values=c("#FCE8B2", "#FFB74D", "#E65100"))+
  ggtitle('2021 Population')

Population

gg_tx_2021_pop

2021

gg_tx_2021

2020

gg_tx_2020

2019

gg_tx_2019

2018

gg_tx_2018

Table

Data in table format.

df_annual %>% filter(state == 'TX') %>% datatable()

Monthly

Visualizing monthly data in map format over 12 months would be visually messy so we can display just a few counties of interest and show how they have changed over the months for the year 2021.

OK

We can start by focusing on counties with medium or higher death rates. This is done by taking the mean death rate for the year 2021 and taking the top counties.

# top counties to plot
num_counties = 16

# join with populations
df_a <- df_monthly %>% 
  filter(state == "OK" & year == "2021") %>% 
  left_join(df_annual %>% select(county_code, year, county_population)) %>% 
  select(date, county, county_code, county_population, monthly_crude_rate_10k)

# get mean death rate per county for later filtering
top_OK <- df_a %>% 
  group_by(county) %>% 
  summarise(mean_rate = mean(monthly_crude_rate_10k)) %>% 
  arrange(-mean_rate) %>% 
  slice(1:num_counties)

top_OK

We can then add some color coding representing high, medium, and low death rates and then plot how each county is impacted over time.

# group based on annual death rate
df_monthly$groups <- cut_number(df_monthly$monthly_crude_rate_10k, 3)
# df_annualtop_10_OK$groups <- cut_number(df_annual$annual_crude_rate_10k, 3)

# join to counties
# df_a <- df_a %>% left_join(df_annual)
df_a <- df_a %>% left_join(df_monthly)

df_a <- df_a %>% 
  mutate(group = as.factor(case_when(
    groups == levels(df_a$groups)[1] ~ "low",
    groups == levels(df_a$groups)[2] ~ "med",
    groups == levels(df_a$groups)[3] ~ "high"
  )))

df_a %>% 
  filter(county %in% top_OK$county) %>% 
  ggplot(aes(x=date, y=monthly_crude_rate_10k, group=county, fill=group, color=group)) +
  geom_line()+
  geom_point()+
  ggtitle("Crude death rate 2021, OK") +
  scale_x_date(date_breaks = "3 months", date_labels =  "%b") +
  facet_wrap(~county)

TX

We can start by focusing on counties with medium or higher death rates. This is done by taking the mean death rate for the year 2021 and taking the top counties.

# join with populations
df_b <- df_monthly %>% 
  filter(state == "TX" & year == "2021") %>% 
  left_join(df_annual %>% select(county_code, year, county_population)) %>% 
  select(date, county, county_code, county_population, monthly_crude_rate_10k)

# get mean death rate per county for later filtering
top_TX <- df_b %>% 
  group_by(county) %>% 
  summarise(mean_rate = mean(monthly_crude_rate_10k)) %>% 
  arrange(-mean_rate) %>% 
  filter(county != "Loving") %>%  # remove outlier
  slice(1:num_counties)

top_TX

We can then add some color coding representing high, medium, and low death rates and then plot how each county is impacted over time.

# group based on annual death rate
df_monthly$groups <- cut_number(df_monthly$monthly_crude_rate_10k, 3)
# df_annualtop_10_OK$groups <- cut_number(df_annual$annual_crude_rate_10k, 3)

# join to counties
# df_b<- df_b%>% left_join(df_annual)
df_b <- df_b %>% left_join(df_monthly)

df_b <- df_b%>% 
  mutate(group = as.factor(case_when(
    groups == levels(df_b$groups)[1] ~ "low",
    groups == levels(df_b$groups)[2] ~ "med",
    groups == levels(df_b$groups)[3] ~ "high"
  )))

df_b %>% 
  filter(county %in% top_TX$county) %>% 
  ggplot(aes(x=date, y=monthly_crude_rate_10k, group=county, fill=group, color=group)) +
  geom_line()+
  geom_point()+
  ggtitle("Crude death rate 2021, TX") +
  scale_x_date(date_breaks = "3 months", date_labels =  "%b") +
  theme(axis.text.x=element_text(angle=60, hjust=1))+
  facet_wrap(~county)